For this project, I wanted to explore trends in student academic performance in relation to other aspects of their academic careers. To do this, I will be using a data set from Kaggle containing data collected from college students in Dallas, TX. Some schools these students attend include Southern Methodist University, University of Texas at Arlington, Dallas College, and more. I was motivated to explore this data because as college students, it can be beneficial to us to see how patterns in other students’ behaviors and statistics align with their academic success. Learning more about what is related to academic success can help us understand what can make a difference in our college experience.
Descriptions of each variable:
As I progressed with this project, I encountered a few things that needed to be addressed.
1. This is not the initial dataset I intended to use. The first one was created from synthetic data and included similar variables about academic performance and career performance. After performing multiple analyses, I ultimately decided to search for a different dataset that was not synthetic. There appeared to be little to no variation between any of the variables, with each one almost perfectly normally distributed and randomly paired with others (e.g., High School and College GPAs had no correlation in any way). For these reasons, I expanded my search and chose this data set due to its similar variables and realistic data collection.
2. Not all of the variables from the original data set are included in this project. I decided to only include the variables that were defined on Kaggle, as about half of them were not described and I did not want to assume anything about what the values could mean. Therefore, the data set went from having 44 variables to 26. Even with the variables defined, some units are unclear and binary values such as gender are not clear. I tried to research more about the data from its original source outside of Kaggle, but did not find any additional information.
3. I could not find any missing values that needed to be cleaned from the data. I rounded each value with a decimal to the nearest hundreth for ease of use, but required no other method of cleaning the data.
Despite some of these difficulties, exploring this dataset helped me learn more about student life and engagement, as well as data manipulation.
Graphs
SES The socioeconomic status of the students is divided into three separate categories – High, Medium, and Low. Most students are split between the Medium and Low class, and only 20% fall under High class. I found this surprising because higher education is typically expensive, and I would have expected more students to be in the High category.
Ethnicity Half of the students are White, while the other half is divided among the rest of the listed ethnicities with Hispanic being the second largest at 19.9%. This is not proportional to the demographic breakdown of Texas as a whole, where about 40% of the population is Hispanic, 40% White, 12% African American, and 8% other (Data USA).
Age Ages of the students range from 18-29. The distribution is almost completely uniform, with a little greater than 5000 students at each age. This helps us understand the scope of the data collection and likely makes us believe the data contains undergraduate and graduate level data.
Gender and Location As previously mentioned, gender was coded in a binary that was never clearly defined (it is unclear which gender the 1s and 0s represent). Because of this, it is not beneficial to discuss the distribution of gender. As for location, every entry was from Dallas, TX, which aligns with the sample collection.
Graphs
Discussion
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5015 2.1625 2.4967 2.4987 2.8339 4.0000
GPA is normally distributed with an IQR of 0.67. The median and mean fall almost exactly on top of each other, right around 2.5. There are multiple outliers in the distribution on both sides of the data. Most of the values are concentrated in that center peak.
SES
High Low Medium
12240 24617 24488
Average GPA by SES Group
# A tibble: 3 × 2
# Groups: SES [3]
SES avgSESGPA
<chr> <dbl>
1 High 2.50
2 Low 2.50
3 Medium 2.50
The GPA distribution is very similar across socioeconomic statuses. All groups are very representative of the sample as a whole with little variation.
While it is not necessary that different SES groups have different trends in GPA, it is peculiar to see how uniform the results are.
Ethnicity
African American Asian Hispanic Other
9333 6140 12193 3160
White
30519
Average GPA by Ethnicity
# A tibble: 5 × 2
# Groups: Ethnicity [5]
Ethnicity avgethGPA
<chr> <dbl>
1 African American 2.50
2 Asian 2.50
3 Hispanic 2.50
4 Other 2.50
5 White 2.50
GPA is also very similar for each ethnicity. There do not appear to be any notable differences among the groups.
Like SES, it is unexpected to see little to no variation between ethnicities.
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 21.00 24.00 23.51 26.00 29.00
Average GPA by Age
# A tibble: 12 × 2
# Groups: Age [12]
Age avgAgeGPA
<dbl> <dbl>
1 18 2.50
2 19 2.50
3 20 2.50
4 21 2.50
5 22 2.50
6 23 2.50
7 24 2.50
8 25 2.50
9 26 2.50
10 27 2.50
11 28 2.50
12 29 2.50
We see the same trend with age, where every age reported has a very similar distribution.
Major_Field_of_Study
Arts Business Engineering Law Science
6139 9342 24388 3092 18384
Average GPA by Major
# A tibble: 5 × 2
# Groups: Major_Field_of_Study [5]
Major_Field_of_Study avgmajorGPA
<chr> <dbl>
1 Arts 2.50
2 Business 2.50
3 Engineering 2.50
4 Law 2.50
5 Science 2.50
Every field of study has a similar distribution with a mean of 2.5.
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 11.60 15.04 15.05 18.38 34.81
There does not appear to be any correlation between GPA and Study Hours per Week. The distribution for study hours per week peaks around 15 with many outliers that go above approximately 27 hours per week.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2211 0.6324 0.6999 0.6996 0.7674 1.0000
There does not appear to be any correlation between GPA and Resource Access Score. I would expect students with a higher resource access score to have higher GPAs, but the data does not reflect this.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.4651 0.5996 0.5980 0.7343 1.0000
Like many of the other variables we looked at, the Stress Indicator Score did not appear to have any relationship with GPA. Stress levels were pretty consistent across Fields of Study.
Across all Fields of Study, there does not appear to be any significant differences in any variable investigated. This includes SES, Ethnicity, Course Difficulty, Learning Satisfaction, Instructor Rating, Stress Levels, Resource Access, and Study Hours per Week. These findings are quite surprising. It is not expected to see such consistency across the board for every single Field of Study.
Frequency Table:Major_Field_of_Study
Arts Business Engineering Law Science
6139 9342 24388 3092 18384
Below is a corrgram of all quantitative variables in the data set. There is no linear relationship between any of the variables studied.
There were some sizable limitations in this study. As mentioned in the introduction, there were some problems with the initial data set which I attributed to it being synthetic data. However, even after choosing this new data set with real data, similar problems arose. Every calculation I attempted showed little to no variation across variables. This is extremely unusual. Real-life data is not usually so “perfect” or consistent. Even though this is unexpected and some variables to not make much sense (e.g., study hours have no effect on GPA), this is what the data set reflects.
Kaggle also did not define all of the variables completely or adequately, so that led to some difficulty in calculations and interpretations of the data. Even though the original data source was cited on Kaggle, it was difficult to find where the data came from and reach the raw data set.
To conclude, there is no way for us to know why exactly the results appear so uniform and if this is a problem with the data, or there is actually this little variation across variables. This project was still helpful in building my skills in R and data exploration.
For future studies, we could to a deeper dive into potential data sets that may offer more in depth analyses of the variables of interest. It may have been more beneficial to take multiple data sets with different information and combine them. After this analysis, we do not know which factors might contribute to academic success, and the data would suggest that none of these variables have any significant effect on it.
Exploring different databases or gathering our own data could help us understand more about the variables and where the data came from. Looking at reports from areas around the country instead of just Dallas would be interesting as well.
---
title: "Academic Performance in Dallas Students"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: yeti
navbar-bg: "blue"
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r}
library(flexdashboard)
library(tidyverse)
library(DT)
library(plotly)
library(vioplot)
library(corrgram)
academic_og<-read_csv("academic_performance_analytics_complete.csv")
academic<-select(academic_og,-c(Class_Size:Final_Exam_Scores, Learning_Material_Satisfaction,Assignment_Completion_Rate,Academic_Support_Utilization, Academic_Performance_Category:Engagement_Level))
attach(academic)
academic$GPA<-round(academic$GPA, digits=2)
academic$Study_Hours_per_Week<-round(academic$Study_Hours_per_Week, digits=2)
academic$Previous_Academic_Performance<-round(academic$Previous_Academic_Performance, digits=2)
academic$Instructor_Rating<-round(academic$Instructor_Rating, digits=2)
academic$Resource_Access_Score<-round(academic$Resource_Access_Score, digits=2)
academic$Peer_Interaction_Score<-round(academic$Peer_Interaction_Score, digits=2)
academic$Stress_Indicator_Score<-round(academic$Stress_Indicator_Score, digits=2)
```
Introduction
===
Column {data-width=500}
---
### <font size=4><span Style = "color:green">Overview</span></font>
For this project, I wanted to explore trends in student academic performance in relation to other aspects of their academic careers. To do this, I will be using a data set from [Kaggle](https://www.kaggle.com/datasets/datasetengineer/edu-students-data-dallas/data) containing data collected from college students in Dallas, TX. Some schools these students attend include Southern Methodist University, University of Texas at Arlington, Dallas College, and more. I was motivated to explore this data because as college students, it can be beneficial to us to see how patterns in other students' behaviors and statistics align with their academic success. Learning more about what is related to academic success can help us understand what can make a difference in our college experience.
### <font size=4><span Style = "color:green">Research Questions</span></font>
- What factors are related to higher GPA?
- What variation spans over different fields of study?
Column {.tabset data-width=500}
---
### Data
```{r}
datatable(academic[1:500,], rownames=FALSE, colnames=c("Timestamp","Student ID","Age","Gender","Ethnicity","SES","Location","Enrollment Status","GPA","Attendance Rate","Study Hours per Week","Extracurricular Participation","Course Load","Major Field of Study","Previous Academic Performance","Course Type","Course Difficulty","Instructor Rating","Learning Style Compatibility","Career Alignment Indicator","Library Usage Frequency","Study Group Participation","Resource Access Score","Peer Interaction Score","Stress Indicator Score","Learning Satisfaction Level"), options=list(pageLength=20))
```
### Variables
Descriptions of each variable:
- **Timestamp:** The date and time when the data was recorded, on an hourly basis.
- **Student_ID:** A unique identifier assigned to each student in the dataset.
- **Age:** The age of the student at the time of data collection.
- **Gender:** The gender of the student (encoded as binary or categorical values).
- **Ethnicity:** The ethnic background of the student, based on available demographic data.
- **SES:** Socioeconomic status indicator reflecting the student's background.
- **Location:** The geographic location where the data was collected, specifically in the Dallas-Fort Worth area.
- **Enrollment_Status:** Status indicating whether the student is enrolled full-time or part-time.
- **GPA:** Grade Point Average, representing the student's academic performance.
- **Attendance_Rate:** The rate at which the student attends classes, expressed as a percentage.
- **Study_Hours_per_Week:** The number of hours the student spends studying each week.
- **Extracurricular_Participation:** A score indicating the level of participation in extracurricular activities.
- **Course_Load:** The number of courses a student is taking during a given period.
- **Major_Field_of_Study:** The major of the student
- **Previous_Academic_Performance:** A historical indicator of the student's academic performance.
- **Course_Type:** The type of course (e.g., lecture, lab, seminar).
- **Course Difficulty:** A qualitative rating of the overall difficulty of the course.
- **Instructor_Rating:** A rating reflecting the student's satisfaction with the instructor's teaching.
- **Learning_Style_Compatibility:** A score indicating how well the student's preferred learning style aligns with the course's format.
- **Career_Alignment_Indicator:** Measures the alignment between the course content and the student's career goals.
- **Library_Usage_Frequency:** The frequency with which the student accesses the library or online learning resources.
- **Study_Group_Participation:** Participation in study groups or collaborative learning activities.
- **Resource_Access_Score:** An indicator of the student's access to academic resources.
- **Peer_Interaction_Score:** A measure of the student's interaction with peers.
- **Stress_Indicator_Score:** A score reflecting the student's reported stress level.
- **Learning_Satisfaction_Level:** Indicator of the student's satisfaction with their learning experience.
### About the Dataset
As I progressed with this project, I encountered a few things that needed to be addressed.
**1.** This is not the initial dataset I intended to use. The first one was created from synthetic data and included similar variables about academic performance and career performance. After performing multiple analyses, I ultimately decided to search for a different dataset that was not synthetic. There appeared to be little to no variation between any of the variables, with each one almost perfectly normally distributed and randomly paired with others (e.g., High School and College GPAs had no correlation in any way). For these reasons, I expanded my search and chose this data set due to its similar variables and realistic data collection.
**2.** Not all of the variables from the original data set are included in this project. I decided to only include the variables that were defined on Kaggle, as about half of them were not described and I did not want to assume anything about what the values could mean. Therefore, the data set went from having 44 variables to 26.
Even with the variables defined, some units are unclear and binary values such as gender are not clear. I tried to research more about the data from its original source outside of Kaggle, but did not find any additional information.
**3.** I could not find any missing values that needed to be cleaned from the data. I rounded each value with a decimal to the nearest hundreth for ease of use, but required no other method of cleaning the data.
Despite some of these difficulties, exploring this dataset helped me learn more about student life and engagement, as well as data manipulation.
Demographics
===
Column {.tabset data-width=600}
---
<font size=4><span Style = "color:green">Graphs</span></font>
### SES
```{r}
X<-table(academic$SES)
percent<-round(100*X/sum(X), 1)
pie_labels<-paste(percent,"%",sep="")
pie(X,main="Distribution of Socioeconomic Status",labels=pie_labels, col=c("#fbb4ae","#b3cde3","#ccebc5"))
legend("topright",c("High","Low","Medium"), cex=0.7, fill=c("#fbb4ae","#b3cde3","#ccebc5"))
```
### Ethnicity
```{r}
H<-table(academic$Ethnicity)
percent<-round(100*H/sum(H), 1)
pie_labels<-paste(percent,"%",sep="")
pie(H,main="Distribution of Ethnicity",labels=pie_labels, col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"))
legend("topright",c("African American","Asian","Hispanic","Other","White"), cex=0.5, fill=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"))
```
### Age
```{r}
academic %>% ggplot(aes(x=Age))+
geom_histogram(fill="#decbe4",color="black",breaks=c(17,18,19,20,21,22,23,24,25,26,27,28,29))+
labs(title="Distribution of Age",
x="Age in Years")
```
Column {data-width=400}
---
### <font size=4><span Style = "color:green">Discussion</span></font>
**SES**
The socioeconomic status of the students is divided into three separate categories -- High, Medium, and Low. Most students are split between the Medium and Low class, and only 20% fall under High class. I found this surprising because higher education is typically expensive, and I would have expected more students to be in the High category.
**Ethnicity**
Half of the students are White, while the other half is divided among the rest of the listed ethnicities with Hispanic being the second largest at 19.9%. This is not proportional to the demographic breakdown of Texas as a whole, where about 40% of the population is Hispanic, 40% White, 12% African American, and 8% other ([Data USA](https://datausa.io/profile/geo/texas)).
**Age**
Ages of the students range from 18-29. The distribution is almost completely uniform, with a little greater than 5000 students at each age. This helps us understand the scope of the data collection and likely makes us believe the data contains undergraduate and graduate level data.
**Gender and Location**
As previously mentioned, gender was coded in a binary that was never clearly defined (it is unclear which gender the 1s and 0s represent). Because of this, it is not beneficial to discuss the distribution of gender. As for location, every entry was from Dallas, TX, which aligns with the sample collection.
GPA
===
Column {.tabset data-width=650}
---
<font size=4><span Style = "color:green">Graphs</span></font>
### GPA Histogram
```{r}
academic %>%
ggplot(aes(x=GPA)) +
geom_histogram(fill="#ccebc5", color="black")+
labs(title="Distribution of Student GPA (Histogram)",
x="GPA",
y="# of Students")->GPAhist
ggplotly(GPAhist)
```
### GPA Boxplot
```{r}
boxplot(GPA, main="Distribution of Student GPA (Boxplot)",ylab="GPA", col="#b3cde3")
```
### SES
```{r}
boxplot(GPA~SES, main="Distribution of GPA by SES",xlab="SES",ylab="GPA", col=c("#b3cde3","#ccebc5","#decbe4"))
```
### Ethnicity
```{r}
par(mar=c(5,2,4,1))
boxplot(GPA~Ethnicity, main="Distribution of GPA by Ethnicity",xlab="Ethnicity",ylab="GPA", col=c("#b3cde3","#ccebc5","#decbe4","#fbb4ae","#fed9a6"))
```
### Age
```{r}
boxplot(GPA~Age, main="Distribution of GPA by Age",col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"))
```
### Field of Study
```{r}
par(mar=c(5,3,4,1))
vioplot(GPA~Major_Field_of_Study,main="Distribution of Student GPA by Field of Study",xlab="Field of Study", ylab="GPA", col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"))
```
### Study Hours
```{r}
ggplot(academic, aes(x=Study_Hours_per_Week,y=GPA)) + geom_point(col="#fbb4ae")+geom_smooth(col="black")+labs(title="GPA by Study Hours per Week",x="Study Hours per Week",y="GPA")
boxplot(Study_Hours_per_Week, horizontal = TRUE, main="Study Hours per Week", xlab="Hours", col="#fed9a6")
```
### Resources
```{r}
ggplot(academic, aes(x=Resource_Access_Score, y=GPA))+geom_point(col="#b3cde3")+geom_smooth(col="black")+labs(title="GPA by Resource Access Score", x="Resource Access Score",y="GPA")
```
### Stress
```{r}
ggplot(academic,
aes(x=GPA, y=Stress_Indicator_Score, color=Major_Field_of_Study))+geom_point()+
facet_wrap(~Major_Field_of_Study)+labs(title="Stress Indicator Score by GPA and Major",x="GPA",y="Stress Indicator Score")+scale_color_brewer(palette="Set3")
```
```{r}
ggplot(academic, aes(x=GPA, y=Attendance_Rate))+geom_point()+geom_smooth()
```
Coulmn {.tabset data-width=350}
---
<font size=4><span Style = "color:green">Discussion</span></font>
### GPA Distribution
Summary Statistics:
```{r}
summary(GPA)
```
GPA is normally distributed with an IQR of 0.67. The median and mean fall almost exactly on top of each other, right around 2.5. There are multiple outliers in the distribution on both sides of the data. Most of the values are concentrated in that center peak.
### SES
Frequency Table:
```{r}
table(SES)
```
Average GPA by SES Group
```{r}
summarize(group_by(academic, SES,
avgSESGPA=mean(GPA)))
```
The GPA distribution is very similar across socioeconomic statuses. All groups are very representative of the sample as a whole with little variation.
While it is not necessary that different SES groups have different trends in GPA, it is peculiar to see how uniform the results are.
### Ethnicity
Frequency Table:
```{r}
table(Ethnicity)
```
Average GPA by Ethnicity
```{r}
summarize(group_by(academic, Ethnicity,
avgethGPA=mean(GPA)))
```
GPA is also very similar for each ethnicity. There do not appear to be any notable differences among the groups.
Like SES, it is unexpected to see little to no variation between ethnicities.
### Age
Summary Statistics:
```{r}
summary(Age)
```
Average GPA by Age
```{r}
summarize(group_by(academic, Age,
avgAgeGPA=mean(GPA)))
```
We see the same trend with age, where every age reported has a very similar distribution.
### Field of Study
Frequency Table:
```{r}
table(Major_Field_of_Study)
```
Average GPA by Major
```{r}
summarize(group_by(academic, Major_Field_of_Study,
avgmajorGPA=mean(GPA)))
```
Every field of study has a similar distribution with a mean of 2.5.
### Study Hours
Summary Statistics:
```{r}
summary(Study_Hours_per_Week)
```
There does not appear to be any correlation between GPA and Study Hours per Week. The distribution for study hours per week peaks around 15 with many outliers that go above approximately 27 hours per week.
### Resources
Summary Statistics:
```{r}
summary(Resource_Access_Score)
```
There does not appear to be any correlation between GPA and Resource Access Score. I would expect students with a higher resource access score to have higher GPAs, but the data does not reflect this.
### Stress
Summary Statistics:
```{r}
summary(Stress_Indicator_Score)
```
Like many of the other variables we looked at, the Stress Indicator Score did not appear to have any relationship with GPA. Stress levels were pretty consistent across Fields of Study.
Fields of Study
===
Column {.tabset data-width=650}
---
### SES
```{r}
ggplot(academic, aes(x=Major_Field_of_Study, fill=SES))+ geom_bar(position="fill")+scale_fill_brewer(palette="Set3")+labs(title="SES by Field of Study",x="Field of Study")->SESFoS
ggplotly(SESFoS)
```
### Ethnicity
```{r}
ggplot(academic, aes(x=Major_Field_of_Study, fill=Ethnicity))+ geom_bar(position="fill")+scale_fill_brewer(palette="Set3")+labs(title="Ethnicity by Field of Study",x="Field of Study")->EthFoS
ggplotly(EthFoS)
```
### Course Difficulty
```{r}
ggplot(academic, aes(x=Major_Field_of_Study, fill=Course_Difficulty))+ geom_bar(position="fill")+scale_fill_brewer(palette="Set3")+labs(title="Course Difficulty by Field of Study",x="Field of Study")->CourseFoS
ggplotly(CourseFoS)
```
### Learning Satisfaction
```{r}
ggplot(academic, aes(x=Major_Field_of_Study, fill=Learning_Satisfaction_Level))+ geom_bar(position="fill")+scale_fill_brewer(palette="Set3")+labs(title="Learning Satisfaction Level by Field of Study",x="Field of Study")->LSLFoS
ggplotly(LSLFoS)
```
### Instructor Rating
```{r}
par(mar=c(5,4,4,0))
boxplot(Instructor_Rating~Major_Field_of_Study,col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"), main="Instructor Rating by Field of Study",xlab="Major Field of Study", ylab="Instructor Rating")
```
### Stress
```{r}
par(mar=c(5,4,4,0))
boxplot(Stress_Indicator_Score~Major_Field_of_Study,col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"), main="Stress Levels by Field of Study",xlab="Major Field of Study", ylab="Stress Indicator Score")
```
### Resources
```{r}
par(mar=c(5,4,4,0))
boxplot(Resource_Access_Score~Major_Field_of_Study,col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"), main="Resource Access Score by Field of Study",xlab="Major Field of Study", ylab="Resource Access Score")
```
### Study Hours
```{r}
par(mar=c(5,4,4,0))
boxplot(Study_Hours_per_Week~Major_Field_of_Study,col=c("#fbb4ae","#b3cde3","#ccebc5","#decbe4","#fed9a6"), main="Study Hours per Week by Field of Study",xlab="Major Field of Study", ylab="Study Hours")
```
Column {data-width=350}
---
### <font size=4><span Style = "color:green">Discussion</span></font>
Across all Fields of Study, there does not appear to be any significant differences in any variable investigated. This includes SES, Ethnicity, Course Difficulty, Learning Satisfaction, Instructor Rating, Stress Levels, Resource Access, and Study Hours per Week. These findings are quite surprising. It is not expected to see such consistency across the board for every single Field of Study.
Frequency Table:
```{r}
table(Major_Field_of_Study)
```
Correlations
===
Below is a corrgram of all quantitative variables in the data set. There is no linear relationship between any of the variables studied.
###
```{r}
corracademic<-select(academic, c("Age","GPA","Attendance_Rate","Study_Hours_per_Week", "Course_Load","Previous_Academic_Performance","Instructor_Rating","Library_Usage_Frequency","Resource_Access_Score","Peer_Interaction_Score","Stress_Indicator_Score"))
corracademic<-corracademic %>% rename(
Attendance=Attendance_Rate,
StudyHRs=Study_Hours_per_Week,
Courses=Course_Load,
Performance=Previous_Academic_Performance,
Instructor=Instructor_Rating,
Library=Library_Usage_Frequency,
Resources=Resource_Access_Score,
Peers=Peer_Interaction_Score,
Stress=Stress_Indicator_Score
)
corrgram(corracademic, order=T, lower.panel=panel.shade, upper.panel=panel.pie, main="Correlation of All Quantitative Variables")
```
Conclusion
===
Column {data-width=300}
---
### <font size=4><span Style = "color:green">Limitations</span></font>
There were some sizable limitations in this study. As mentioned in the introduction, there were some problems with the initial data set which I attributed to it being synthetic data. However, even after choosing this new data set with real data, similar problems arose. Every calculation I attempted showed little to no variation across variables. This is extremely unusual. Real-life data is not usually so "perfect" or consistent. Even though this is unexpected and some variables to not make much sense (e.g., study hours have no effect on GPA), this is what the data set reflects.
Kaggle also did not define all of the variables completely or adequately, so that led to some difficulty in calculations and interpretations of the data. Even though the original data source was cited on Kaggle, it was difficult to find where the data came from and reach the raw data set.
To conclude, there is no way for us to know why exactly the results appear so uniform and if this is a problem with the data, or there is actually this little variation across variables. This project was still helpful in building my skills in R and data exploration.
Column {data-width=300}
---
### <font size=4><span Style = "color:green">Future Directions</span></font>
For future studies, we could to a deeper dive into potential data sets that may offer more in depth analyses of the variables of interest. It may have been more beneficial to take multiple data sets with different information and combine them. After this analysis, we do not know which factors might contribute to academic success, and the data would suggest that none of these variables have any significant effect on it.
Exploring different databases or gathering our own data could help us understand more about the variables and where the data came from. Looking at reports from areas around the country instead of just Dallas would be interesting as well.
Column {data-width=300}
---
### <font size=4><span Style = "color:green">About the Author</span></font>
My name is Audrey DeGregorio. I am a Psychology major with minors in Data Analytics and Family Development.I graduate from the University of Dayton in Spring 2026.
On campus, I work in Dr. O'Mara Kunz's research lab and as a TA for the Psychology Department. I am also the President of Active Minds, UD's mental health club, a mentor for Big Brothers Big Sisters, and a member of Pi Beta Phi.
After undergraduate, I hope to pursue my Ph.D. in Experimental Psychology and ultimately become a researcher and professor.
[Handshake](https://udayton.joinhandshake.com/profiles/yejc8k)
### <font size=4><span Style = "color:green">References</span></font>
[Dallas Student Data Set](https://www.kaggle.com/datasets/datasetengineer/edu-students-data-dallas/data)
[Texas Demographics](https://datausa.io/profile/geo/texas)